10 research outputs found
Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms
The reliable fraction of information is an attractive score for quantifying
(functional) dependencies in high-dimensional data. In this paper, we
systematically explore the algorithmic implications of using this measure for
optimization. We show that the problem is NP-hard, which justifies the usage of
worst-case exponential-time as well as heuristic search methods. We then
substantially improve the practical performance for both optimization styles by
deriving a novel admissible bounding function that has an unbounded potential
for additional pruning over the previously proposed one. Finally, we
empirically investigate the approximation ratio of the greedy algorithm and
show that it produces highly competitive results in a fraction of time needed
for complete branch-and-bound style search.Comment: Accepted to Proceedings of the IEEE International Conference on Data
Mining (ICDM'18
Discovering robust dependencies from data
Science revolves around forming hypotheses, designing experiments, collecting data, and tests. It was not until recently, with the advent of modern hardware and data analytics, that science shifted towards a big-data-driven paradigm that led to an unprecedented success across various fields. What is perhaps the most astounding feature of this new era, is that interesting hypotheses can now be automatically discovered from observational data. This dissertation investigates knowledge discovery procedures that do exactly this. In particular, we seek algorithms that discover the most informative models able to compactly âdescribeâ aspects of the phenomena under investigation, in both supervised and unsupervised settings. We consider interpretable models in the form of subsets of the original variable set. We want the models to capture all possible interactions, e.g., linear, non-linear, between all types of variables, e.g., discrete, continuous, and lastly, we want their quality to be meaningfully assessed. For this, we employ information-theoretic
measures, and particularly, the fraction of information for the supervised setting, and the normalized total correlation for the unsupervised. The former measures the uncertainty reduction of the target variable conditioned on a model, and the latter measures the information overlap of the variables included in a model.
Without access to the true underlying data generating process, we estimate the aforementioned measures from observational data. This process is prone to statistical errors, and in our case, the errors manifest as biases towards larger models. This can lead to situations where the results are utterly random, hindering
therefore further analysis. We correct this behavior with notions from statistical learning theory. In particular, we propose regularized estimators that are unbiased under the hypothesis of independence, leading to robust estimation from limited data samples and arbitrary dimensionalities. Moreover, we do this for models
consisting of both discrete and continuous variables. Lastly, to discover the top scoring models, we derive effective optimization algorithms for exact, approximate, and heuristic search. These algorithms are
powered by admissible, tight, and efficient-to-compute bounding functions for our proposed estimators that can be used to greatly prune the search space. Overall, the products of this dissertation can successfully assist data analysts with data exploration, discovering powerful description models, or concluding that
no satisfactory models exist, implying therefore new experiments and data are required for the phenomena under investigation. This statement is supported by Materials Science researchers who corroborated our discoveries.In der Wissenschaft geht es um Hypothesenbildung, Entwerfen von Experimenten, Sammeln von Daten und Tests. JuÌngst hat sich die Wissenschaft, durch das Aufkommen moderner Hardware und Datenanalyse, zu einem Big-Data-basierten Paradigma hin entwickelt, das zu einem beispiellosen Erfolg in verschiedenen Bereichen gefuÌhrt hat. Ein erstaunliches Merkmal dieser neuen ra ist, dass interessante Hypothesen jetzt automatisch aus Beobachtungsdaten entdeckt werden kânnen. In dieser Dissertation werden Verfahren zur Wissensentdeckung untersucht, die genau dies tun. Insbesondere suchen wir nach Algorithmen, die Modelle identifizieren, die in der Lage sind, Aspekte der untersuchten Phânomene sowohl in beaufsichtigten als auch in unbeaufsichtigten Szenarien kompakt zu âbeschreibenâ. Hierzu betrachten wir interpretierbare Modelle in Form von Untermengen der urspruÌnglichen Variablenmenge. Ziel ist es, dass diese Modelle alle mâglichen Interaktionen erfassen (z.B. linear, nicht-lineare), zwischen allen Arten von Variablen unterscheiden (z.B. diskrete, kontinuierliche) und dass schlussendlich ihre Qualitât sinnvoll bewertet wird. Dazu setzen wir informationstheoretische Maâe ein, insbesondere den Informationsanteil fuÌr das uÌberwachte und die normalisierte Gesamtkorrelation fuÌr das unuÌberwachte Szenario. Ersteres misst die Unsicherheitsreduktion der Zielvariablen, die durch ein Modell bedingt ist, und letztere misst die InformationsuÌberlappung der enthaltenen Variablen. Ohne Kontrolle des Datengenerierungsprozesses werden die oben genannten Maâe aus Beobachtungsdaten geschâtzt. Dies ist anfâllig fuÌr statistische Fehler, die zu Verzerrungen in grââeren Modellen fuÌhren. So entstehen Situationen, wobei die Ergebnisse vâllig zufâllig sind und somit weitere Analysen stâren. Wir korrigieren dieses Verhalten mit Methoden aus der statistischen Lerntheorie. Insbesondere schlagen wir regularisierte Schâtzer vor, die unter der Hypothese der Unabhângigkeit nicht verzerrt sind und somit zu einer robusten Schâtzung aus begrenzten Datenstichproben und willkuÌrlichen-Dimensionalitâten fuÌhren. DaruÌber hinaus wenden wir dies fuÌr Modelle an, die sowohl aus diskreten als auch aus kontinuierlichen Variablen bestehen. Um die besten Modelle zu entdecken, leiten wir effektive Optimierungsalgorithmen mit verschiedenen Garantien ab. Diese Algorithmen basieren auf speziellen Begrenzungsfunktionen der vorgeschlagenen Schâtzer und erlauben es den Suchraum stark einzuschrânken. Insgesamt sind die Produkte dieser Arbeit sehr effektiv fuÌr die Wissensentdeckung. Letztere Aussage
wurde von Materialwissenschaftlern bestâtigt
Discovering Reliable Correlations in Categorical Data
In many scientific tasks we are interested in discovering whether there exist any correlations in our data. This raises many questions, such as how to reliably and interpretably measure correlation between a multivariate set of attributes, how to do so without having to make assumptions on distribution of the data or the type of correlation, and, how to efficiently discover the top-most reliably correlated attribute sets from data. In this paper we answer these questions for discovery tasks in categorical data.
In particular, we propose a corrected-for-chance, consistent, and efficient estimator for normalized total correlation, by which we obtain a reliable, naturally interpretable, non-parametric measure for correlation over multivariate sets. For the discovery of the top-k correlated sets, we derive an effective algorithmic framework based on a tight bounding function. This framework offers exact, approximate, and heuristic search. Empirical evaluation shows that already for small sample sizes the estimator leads to low-regret optimization outcomes, while the algorithms are shown to be highly effective for both large and high-dimensional data. Through two case studies we confirm that our discovery framework identifies interesting and meaningful correlations
Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms (Extended Abstract)
Finding (functional) dependencies between attributes in databases is a well-known problem with applications in knowledge discovery, feature selection, and database management. While the recently introduced reliable fraction of information measure allows to soundly quantify dependence in a way that avoids overfitting when optimizing over high-dimensional spaces, the algorithmic implications of using this score have not yet been systematically explored. This includes the computational complexity of the resulting optimization problem.
To this end, this paper provides the following contributions: We show that the problem of maximizing the reliable fraction of information is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods that do not guarantee optimal solutions. We then greatly improve the practical performance for both of these optimization styles by deriving a novel admissible bounding function, which has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate for the first time the approximation ratio of the greedy algorithm and show that in fact it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search. All findings are evaluated on a wide range of real-world datasets that are publicly available along with the implementation of the algorithmic contributions.
Our results suggest that in scenarios where no hard optimality guarantees are required, greedy optimization is a good alternative to branch-and-bound for dependency discovery. Also, the definition of the tighter bounding function is potentially more generally applicable than just to the reliable fraction of information and might be transferrable to other dependency measures
Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms
The reliable fraction of information is an attractive score for quantifying (functional) dependencies in high-dimensional data. In this paper, we systematically explore the algorithmic implications of using this measure for optimization. We show that the problem is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods. We then substantially improve the practical performance for both optimization styles by deriving a novel admissible bounding function that has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate the approximation ratio of the greedy algorithm and show that it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search
Universal Dependency Analysis
Finding patterns from binary data is a classical problem in data mining, dating back to at least frequent itemset mining. More recently, approaches such as tiling and Boolean matrix factorization (BMF), have been proposed to find sets of patterns that aim to explain the full data well. These methods, however, are not robust against non-trivial destructive noise, i.e. when relatively many 1s are removed from the data: tiling can only model additive noise while BMF assumes approximately equal amounts of additive and destructive noise. Most real-world binary datasets, however, exhibit mostly destructive noise. In presence/absence data, for instance, it is much more common to fail to observe something than it is to observe a spurious presence. To address this problem, we take the recent approach of employing the Minimum Description Length (MDL) principle for BMF and introduce a new algorithm, Nassau, that directly optimizes the description length of the factorization instead of the reconstruction error. In addition, unlike the previous algorithms, it can adjust the factors it has discovered during its search. Empirical evaluation on synthetic data shows that Nassau excels at datasets with high destructive noise levels and its performance on real-world datasets confirms our hypothesis of the high numbers of missing observations in the real-world data
Dabigatran Associated Leukocytoclastic Vasculitis
Common side effects of dabigatran are bleeding, bruising, nausea, diarrhea, and abdomen discomfort. Skin reactions were not often noted (<0.1%). We report a case of 70-year-old male who developed dabigatran related skin reaction resistant to usual therapy. Skin biopsy revealed leukocytoclastic vasculitis